Skip to main content

Stack Exchange Data Dump

Item Preview

There Is No Preview Available For This Item

This item does not appear to have any files that can be experienced on Archive.org.

Show all files

data
Stack Exchange Data Dump


Published March 14, 2017


This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. For complete schema information, see the included readme.txt.

All user content contributed to the Stack Exchange network is cc-by-sa 3.0 licensed, intended to be shared and remixed. We even provide all our data as a convenient data dump.
License: http://creativecommons.org/licenses/by-sa/3.0/
But our cc-by-sa 3.0 licensing, while intentionally permissive, does require attribution:
Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
Specifically the attribution requirements are as follows:
  1. Visually display or otherwise indicate the source of the content as coming from the Stack Exchange Network. This requirement is satisfied with a discreet text blurb, or some other unobtrusive but clear visual indication.

  2. Ensure that any Internet use of the content includes a hyperlink directly to the original question on the source site on the Network (e.g., http://stackoverflow.com/questions/12345)

  3. Visually display or otherwise clearly indicate the author names for every question and answer used

  4. Ensure that any Internet use of the content includes a hyperlink for each author name directly back to his or her user profile page on the source site on the Network (e.g., http://stackoverflow.com/users/12345/username), directly to the Stack Exchange domain, in standard HTML (i.e. not through a Tinyurl or other such indirect hyperlink, form of obfuscation or redirection), without any “nofollow” command or any other such means of avoiding detection by search engines, and visible even with JavaScript disabled.

For more information, see the Stack Exchange Terms of Service.


Identifier stackexchange
Publicdate 2014-01-21 18:54:32
Mediatype data
Addeddate 2014-01-21 18:54:32
Creator Stack Exchange, Inc.
Date 2017-03-14
Year 2017
Year 2016
Year 2015
Year 2014
Contributor Stack Exchange Community
Licenseurl http://creativecommons.org/licenses/by-sa/3.0/
Backup_location ia905803_10

comment
Reviews

Reviewer: coder_chenzhi - favoritefavoritefavorite - March 24, 2017
Subject: Incomplete
Why no softwareengineering.stackexchange?
Reviewer: fturco - favoritefavoritefavoritefavoritefavorite - March 5, 2017
Subject: Better filenames
Let me suggest you to add dates in YYYY-MM-DD or YYYYMMDD format into the filenames. For example: stackexchange-20161215/3dprinting.stackexchange.com.7z or stackexchange/3dprinting.stackexchange.com.20161215.7z
Reviewer: Mooash - - June 28, 2016
Subject: Torrent out of date
Looks like the torrent is out of date again, I'm stuck at 800MB whilst it continually tries to download files.
Reviewer: amz3 - favoritefavoritefavoritefavorite - March 31, 2016
Subject: Good but missing data
The data is almost complete except that it's missing badge information (which is gold, silver, etc...) and what they mean.
Reviewer: Greg Lindahl - favoritefavoritefavoritefavoritefavorite - March 18, 2016
Subject: .torrent is fixed
The former limitation of 25 gigabytes for a torrent have been relaxed for this item, and the torrent is again working.

If you want to parse these large XML files, you need to use a streaming parser. I'm not surprised that an XML editor would have problems with a 40 gigabyte XML file! Most XML parsing software libraries have a streaming option.
Reviewer: vishal14 - - March 5, 2016
Subject: How to open posts.xml file?
I have downloaded and extracted posts.xml file of stackoverflow. Size of the file is around 40 GB and I am not able to open it in xml editors. Can someone please suggest how to open or parse this huge file?
Reviewer: shankar321 - favorite - January 11, 2016
Subject: What sense do we make of the files?
I see over 300 files, whereas there are only 6 XML files (large ones) as far as I remember.

I cannot download the torrent version either - can someone help make sense of the files?

Shankar
Reviewer: jlewi - - December 30, 2015
Subject: torrent out of date
The bit torrent link:
https://archive.org/download/stackexchange/stackexchange_archive.torrent

appears to link to an older version of the data dump. For stackoverflow the latest post I saw was from 2014.

However, if the .zip file
https://archive.org/compress/stackexchange/formats=7Z&file=/stackexchange.zip

appears to have data upto august 2015.
Reviewer: cirosantilli - - November 21, 2015
Subject: The current dump is old from 2014-09
http://meta.stackexchange.com/questions/264565/when-was-the-last-data-dump-upload-to-archive-org
Reviewer: Jesse_W - - November 6, 2015
Subject: This hit the shuffle.php bug
And that broke the webseeds (see https://archive.org/post/1047899 ). This review should hopefully fix the issue.
Reviewer: D1Doris - favoritefavoritefavoritefavorite - September 4, 2015
Subject: stackexchange_meta.sqlite
StevenLJohnson, I have the file and can open it without any problems. What exactly do you mean by "unable to access"? If you mean that you don't have the file, I can send it to you.
Reviewer: StevenLJohnson - favoritefavoritefavorite - August 31, 2015
Subject: Missing File
I am unable to access the file stackexchange_meta.sqlite

Does anyone know of a source for this file?
Reviewer: xcombelle - - July 21, 2015
Subject: for those which can't make bittorent working
There is some reasons for it https://archive.org/about/faqs.php#Archive_BitTorrents
Reviewer: timiblossom - favoritefavorite - July 2, 2015
Subject: Bittorrent download broken
It stopped at around 70% for a couple of days and could never move forward.
Reviewer: sathvik - favoritefavoritefavoritefavoritefavorite - May 22, 2015
Subject: Thanks for sharing
Thanks for sharing the community data. It will greatly benefit research groups.
Reviewer: dmpetrov - favoritefavoritefavoritefavoritefavorite - May 3, 2015
Subject: April data
Great data set. Thank you for sharing.

I see only March data. How can I get April data?
What about January and February?

Thanks,
Dmitry
Reviewer: alisa1 - - April 10, 2015
Subject: Resolved!
I also tried couple of times. It was failed at the same point.
But then I tried when I logged in, and I was able to download the whole file :-)
Reviewer: big_t_dub - favorite - April 3, 2015
Subject: 70%
stuck at 70.7% download complete via utorrent- arg!

this should be made avail via ftp!!!
Reviewer: Ihor Bobak - - March 27, 2015
Subject: File is broken
At the top right corner of this page there is a link to zip archive. I've downloaded it twice (on different machines, in different countries). The file was always broken.

Torrent stucks on 70.8%.

Can anyone help to get this file?
Reviewer: gnijuohz - favoritefavoritefavorite - March 25, 2015
Subject: No seed?
It stopped at around 70%.
Reviewer: klitzkrieg - favoritefavoritefavorite - March 24, 2015
Subject: Seeds for 3/16/15 version?
Everybody's stuck at 70.7%
Reviewer: Nemo_bis - favoritefavoritefavoritefavoritefavorite - February 7, 2015
Subject: Thanks and tests
Thanks for the September update, eager to see the next one. Did someone try importing this data into a StackExchange instance?

Fun to see how small the whole SE network is after all, only few GB compressed. Wikimedia projects dumps compress very well too, but they're still much bigger (while fitting a common hard disk anyway!).
Reviewer: shamsazad - favoritefavoritefavoritefavorite - August 19, 2014
Subject: Latest Dump.
When will be latest dump from stackoverflow will be posted over here.
Reviewer: Jenson555 - favoritefavoritefavoritefavoritefavorite - July 26, 2014
Subject: Really Cool
This is an Awesome Stuff..Cheers..:)
DOWNLOAD OPTIONS
7Z
Uplevel BACK
In Collection
Community Media
Uploaded by
Stack Exchange
on 1/21/2014
Views
92,743
Favorites
29
Reviews
24
SIMILAR ITEMS (based on metadata)
Data Collection
collection
415,041
ITEMS
87.7M
VIEWS
collection
eye 87.7M
The Archive Team Just In Time Grabs
by Jeff Atwood, Stackoverflow.com
web
eye 663
favorite 0
comment 1
favoritefavorite ( 1 reviews )
Community Media
texts
eye 11
favorite 0
comment 0
Community Media
image
eye 99
favorite 0
comment 0
Community Media
data
eye 36,571
favorite 1
comment 0
Cuil Crawl Data
collection
0
ITEMS
22M
VIEWS
collection
eye 22M
Community Media
texts
eye 49
favorite 0
comment 0
Community Media
by SANTA BARBARIAN
image
eye 48
favorite 0
comment 0
Community Media
texts
eye 19
favorite 0
comment 0